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Abstract 

This  paper  presents  a  novel  reinforcement  learning  algorithm  and  provides  conditions  for  global 
convergence  to  Nash  equilibria.  For  several  classes  of  reinforcement  learning  schemes,  including  the 
ones  proposed  here,  excluding  convergence  to  action  profiles  which  are  not  Nash  equilibria  may  not  be 
trivial,  unless  the  step-size  sequence  is  appropriately  tailored  to  the  specifics  of  the  game.  In  this  paper, 
we  sidestep  these  issues  by  introducing  a  perturbed  reinforcement  learning  scheme  where  the  strategy  of 
each  agent  is  perturbed  by  a  strategy-dependent  perturbation  (or  mutations)  function.  Contrary  to  prior 
work  on  equilibrium  selection  in  games  where  perturbation  functions  are  globally  state  dependent,  the 
perturbation  function  here  is  assumed  to  be  local,  i.e.,  it  only  depends  on  the  strategy  of  each  agent. 
We  provide  conditions  under  which  the  strategies  of  the  agents  will  converge  to  an  arbitrarily  small 
neighborhood  of  the  set  of  Nash  equilibria  almost  surely.  This  extends  prior  analysis  on  reinforcement 
learning  in  games  which  has  been  primarily  focused  on  urn  processes.  We  finally  specialize  the  results 
to  a  class  of  potential  games. 


1  Introduction 

Lately,  agent-based  modeling  has  generated  significant  interest  in  various  settings,  such  as  engineering, 
social  sciences  and  economics.  In  those  formulations,  agents  make  decisions  independently  and  without 
knowledge  of  the  actions  or  intentions  of  the  other  agents.  Usually,  the  interactions  among  agents  can  be 
described  in  terms  of  a  strategic-form  game,  and  stability  notions,  such  as  the  Nash  equilibrium,  can  be 
utilized  to  describe  desirable  outcomes  for  all  agents. 

In  this  paper,  we  are  interested  in  deriving  conditions  under  which  agents  learn  to  play  Nash  equilibria. 
Assuming  minimum  information  available  to  each  agent,  namely  its  own  utility  and  actions,  we  introduce  a 

‘This  work  was  supported  by  the  Swedish  Research  Council  through  the  Linnaeus  Center  LCCC  and  the  AFOSR  MURI  project 
#FA9550-09-l-0538. 

^G.C.  Chasparis  is  with  the  Software  Competence  Center,  Hagenberg,  Austria,  gchasparis@gmail.com. 
http://www.chasparis.blogspot.gr 

U.S.  Shamma  is  with  the  School  of  Electrical  and  Computer  Engineering,  Georgia  Institute  of  Technology,  Atlanta,  GA,  30332, 
shamma@gatech.edu,  www. prism. gatech.edu/~jshamma3 

§A.  Rantzer  is  with  the  Department  of  Automatic  Control,  Lund  University,  Lund,  Sweden,  rantzer@control.lth.se, 
www.control.lth.se/Staff/anders_  rantzer.html 


1 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

07  DEC  2012 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2012  to  00-00-2012 

4.  TITLE  AND  SUBTITLE 

Nonconvergence  to  Saddle  Boundary  Points  under  Perturbed 
Reinforcement  Learning 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Georgia  Institute  of  Technology, School  of  Electrical  and  Computer 
Engineering, Atlanta, GA, 30332 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

submitted  for  journal  publication 

14.  ABSTRACT 

This  paper  presents  a  novel  reinforcement  learning  algorithm  and  provides  conditions  for  global 
convergence  to  Nash  equilibria.  For  several  classes  of  reinforcement  learning  schemes,  including  the  ones 
proposed  here,  excluding  convergence  to  action  profiles  which  are  not  Nash  equilibria  may  not  be  trivial, 
unless  the  step-size  sequence  is  appropriately  tailored  to  the  specifics  of  the  game.  In  this  paper  we  sidestep 
these  issues  by  introducing  a  perturbed  reinforcement  learning  scheme  where  the  strategy  of  each  agent  is 
perturbed  by  a  strategy-dependent  perturbation  (or  mutations)  function.  Contrary  to  prior  work  on 
equilibrium  selection  in  games  where  perturbation  functions  are  globally  state  dependent,  the  perturbation 
function  here  is  assumed  to  be  local,  i.e.,  it  only  depends  on  the  strategy  of  each  agent.  We  provide 
conditions  under  which  the  strategies  of  the  agents  will  converge  to  an  arbitrarily  small  neighborhood  of 
the  set  of  Nash  equilibria  almost  surely.  This  extends  prior  analysis  on  reinforcement  learning  in  games 
which  has  been  primarily  focused  on  urn  processes.  We  finally  specialize  the  results  to  a  class  of  potential 
games. 


15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

ABSTRACT 

18.  NUMBER 

OF  PAGES 

19a.  NAME  OF 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

30 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


novel  reinforcement  learning  scheme  and  derive  conditions  under  which  global  convergence  to  Nash  equi¬ 
libria  can  be  achieved. 

In  reinforcement  learning  schemes,  agents  build  their  confidence  over  an  action  through  repeated  se¬ 
lection  of  this  action  and  proportionally  to  the  reward  received  from  this  action.  In  particular,  each  agent 
updates  a  probability  distribution  over  its  available  actions  and  the  probability  of  selecting  an  action  in¬ 
creases  whenever  this  action  is  selected  and  proportionally  to  the  reward  received.  This  class  of  dynamics 
has  been  applied  in  evolutionary  economics,  for  modeling  human  and  economic  behavior  [1,  2,  3,  4,  5]  and 
sociology,  for  modeling  social  network  formation  [6,  7]. 

Reinforcement  learning  schemes  are  also  related  to  replicator  dynamics  [8]  as  has  been  pointed  out  by 
several  authors  [2,  4,  5].  For  example,  in  [2,  9],  the  asymmetric,  continuous  replicator  dynamics  (cf.,  [10]) 
have  been  identified  as  the  continuous  time  limit  of  a  reinforcement  learning  scheme  which  is  based  on 
Bush-Mosteller’s  [11]  simple  learning  model. 

One  of  the  main  concerns  in  the  analysis  of  reinforcement  learning  schemes  is  showing  nonconvergence 
to  boundary  points  of  the  probability  simplex  which  do  not  correspond  to  Nash  equilibria.  In  fact,  as  pointed 
out  in  [4],  establishing  nonconvergence  to  the  boundary  of  the  probability  simplex  might  not  be  trivial, 
since  standard  results  of  the  ODE  method  for  stochastic  approximations  (e.g.,  nonconvergence  to  unstable 
equilibria  [12])  are  not  applicable.  Thus,  the  behavior  of  several  reinforcement  learning  models,  e.g.,  the 
model  by  [1],  cannot  be  directly  related  to  (standard)  replicator  dynamics.  This  is  mainly  due  to  the  fact  that 
several  models  of  reinforcement  learning  may  converge  to  saddle  boundary  points  of  the  replicator  dynamics 
[4]. 

In  this  paper,  we  sidestep  these  issues  by  introducing  a  new  class  of  reinforcement  learning  schemes 
where  the  strategies  of  each  agent  are  perturbed  by  a  state-dependent  perturbation  function.  Contrary  to 
prior  work  on  equilibrium  selection  where  perturbation  functions  are  also  state  dependent  [13],  the  pertur¬ 
bation  function  here  is  assumed  to  be  local,  i.e.,  it  only  depends  on  the  strategy  of  each  player.  Due  to 
this  perturbation  function,  the  ODE  method  for  stochastic  approximations  can  be  applied,  since  boundary 
points  of  the  domain  cease  to  be  stationary  points  of  the  relevant  ODE.  This  paper  extends  prior  work  [14] 
of  the  authors,  where  the  perturbation  function  was  assumed  constant  independently  of  the  strategy.  In  par¬ 
ticular,  we  provide  conditions  under  which  the  strategies  of  the  agents  will  converge  to  an  arbitrarily  small 
neighborhood  of  the  set  of  Nash  equilibria  almost  surely. 

We  further  specialize  the  results  to  a  class  of  games  which  belongs  to  the  family  of  potential  games  [15]. 
It  includes  common-payoff  (or  identical-interest)  games,  congestion  games  [16],  and  two-player  rescaled 
partnership  games  [10].  Potential  games  are  also  of  particular-  interest  in  engineering,  for  example  in  con¬ 
gestion  control  [16],  distributed  spatial  coverage  [17]  and  distributed  routing  [18].  In  these  examples,  and 
when  agents  are  playing  the  game  repeatedly,  learning  to  play  a  Nash  equilibrium  is  of  special  interest, 
especially  when  the  information  available  to  each  agent  is  only  the  history  of  its  own  utilities  and  its  own 
actions.  We  provide  conditions  under  which  the  proposed  reinforcement  learning  scheme  converges  to  the 
set  of  pure  Nash  equilibria  for  this  class  of  games.  This  is  also  an  extension  of  prior  work  on  reinforcement 
learning  [5,  4]  in  potential  games  which  has  primarily  focused  on  the  urn  process  of  [3]. 

The  remainder  of  the  paper  is  organized  as  follows.  Section  2  introduces  the  necessary  terminology. 
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Section  3  introduces  the  perturbed  reinforcement  learning  scheme  with  a  state-based  perturbation  function. 
Section  4  states  some  standard  results  from  Lyapunov-based  techniques  and  the  ODE  method  for  analyzing 
stochastic  approximations.  Sections  5  characterizes  the  set  of  stationary  points  of  the  reinforcement  learning 
scheme  for  both  the  unperturbed  and  the  perturbed  dynamics.  Section  6  analyzes  the  behavior  of  the  unper¬ 
turbed  reinforcement  learning  scheme  close  to  the  boundary  points  of  the  domain,  while  Section  7  analyzes 
the  convergence  properties  of  the  perturbed  learning  scheme.  Finally,  Section  8  specializes  the  results  to  a 
class  of  games  which  belongs  to  the  family  of  potential  games,  and  Section  9  presents  concluding  remarks. 

Notation: 

—  \x\  denotes  the  Euclidean  norm  of  a  vector  x  <G  Mn. 

—  |x|oo  denotes  the  ^-norm  of  a  vector  x  £  Mra. 

—  B$(x)  denotes  the  5-neighborhood  of  vector  x  £  Rn,  i.e., 

B5(x)±{y£Rn  :  \x-y\<6}. 

—  dist(x,  A)  from  a  point  x  to  a  set  A  is  defined  as 

dist(x,  A )  =  inf  \x  —  y\ . 
y£A 

—  Bs(A)  denotes  the  5-neighborhood  of  set  A  C  R",  i.e., 

B$(A)  =  {x  :  dist(x,  A)  <  5}. 


—  A  (to)  denotes  the  probability  simplex  of  dimension  to,  i.e., 

A  (to)  =  {x  £  Rm  :  x  >  0,  l1^  =  l}  , 

where  1  is  the  vector  of  ones  of  appropriate  dimension. 

—  IIa  :  Rm  — >  A  (to)  is  the  projection  to  the  probability  simplex,  i.e., 

HA\x]  =  arg  min  \x  —  y\. 
y£A(m) 

—  A°  is  the  interior  of  a  subset  A  of  Mn,  and  <9,4  is  its  boundary. 

—  row { o.j }  jAj  denotes  the  block  row  vector  with  entries  { ot }<:e,7  for  some  set  of  indices  J ,  i.e., 

TCw{oii}i&J  =  (  «1  •  •  •  Oi\J\  )  , 

where  £  M1  xn'  for  some  nt  £  F9,  i  £  J .  Likewise,  col{-}  will  denote  a  block  column  vector. 
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diag  denotes  the  block  diagonal  matrix  with  diagonal  entries  {At  }jaj  for  some  set  of  indices 

J,  i.e., 


a  ,  * 

diag  {Ai}ieJ  = 

where  A. ,  G  MraiXmi  for  some  n*,  £  N,  i  £  J . 


A 


\J\ 


2  Terminology 

We  consider  the  standard  setup  of  finite  strategic-form  games. 

2.1  Game 

A  finite  strategic-form  game  involves  a  finite  set  of  agents  (or  players),  X  =  {1,2,  Each  agent 

%  G  X  has  a  finite  set  of  available  choices  (or  actions),  Ai.  Let  a,  G  A%  denote  an  action  of  agent  i,  and 
a  =  (ai,  «2, an)  the  action  profile  of  all  agents.  The  set  A  is  the  Cartesian  product  of  the  action  spaces 
of  all  agents,  i.e.,  A  =  A\  x  ...  x  An, 

The  action  profile  a  £  A  produces  a  payoff  (or  utility)  for  each  agent.  The  utility  of  agent  i,  denoted 
by  Ri,  is  a  function  which  maps  the  action  profile  a  to  a  payoff  in  M.  It  constitutes  a  measure  of  the 
desirability  of  the  action  profile  a,  where  a  high-payoff  action  profile  is  more  desirable  than  a  low-payoff 
action  profile.  Let  also  denote  by  R  :  A  — >  Mn  the  combination  of  payoffs  (or  payoff  profile)  of  all  agents, 
i.e.,  R(-)  =  (Hi  ( • ) ,  R:2 (■), Rn ('))■  A  strategic-form  game  will  then  be  completely  characterized  by  the 
triple  { X ,  A,  R}. 

2.2  Strategy 

Since  each  agent  selects  actions  independently,  we  generally  assume  that  each  agent’s  action  is  a  realization 
of  an  independent  discrete  random  variable.  Let  G  [0, 1]  denote  the  probability  that  agent  i  selects 
action  a*  =  j  G  A\.  If  =  '  •  ll1cn  —  ffi\-  &i2,  ■■■iai\Ai\)  's  a  probability  distribution  over 

the  set  of  actions  Ai  (or  strategy  of  agent  i),  where  \A-L\  denote  the  cardinality  of  the  set  A,.  Then  <i,  £ 
A(|yl;|).  We  will  also  use  the  term  strategy  profile  to  denote  the  combination  of  strategies  of  all  agents 
a  =  (di,  (72, ...,  off)  £  A  where  A  =  A(|.4i|)  x  ...  x  A(|^4n|)  is  the  set  of  strategy  profiles. 

Note  that  if  a,  is  a  unit  vector  (or  a  vertex  of  A(|«4j|)),  say  e;],  then  agent  i  selects  an  action  j  with 
probability  one.  Such  a  strategy  will  be  called  pure  strategy.  Likewise,  a  pure  strategy  profile  is  a  profile  of 
pure  strategies.  We  will  also  use  the  term  mixed  strategy  to  denote  a  strategy  that  is  not  pure. 
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2.3  Expected  payoff  and  Nash  equilibrium 

Given  a  strategy  profile  a  £  A,  the  expected  payoff  vector  of  each  agent  i,  Ui  :  A  — >■  can  be 

computed  by 

Uffo)  =  'Yh  ei  (  II  a-sa» )  Ri(j-  a-t)-  (!) 

j€>A.i  A—i  VsG — i  J 

We  may  think  of  the  entry  j  of  the  expected  payoff  vector,  denoted  Uij(o),  as  the  payoff  of  agent  i 
who  is  playing  action  j  at  strategy  profile  a.  We  denote  the  profile  of  expected  payoffs  by  U(o)  = 
(U\  (a), ...,  Un(o)).  Finally,  let  ufio )  be  the  expected  payoff  of  agent  i  at  strategy  profile  a  £  A,  defined  as 
follows: 

Ui(o)  =  of  Ufio). 

In  the  trivial  case  of  n  =  2,  it  is  straightforward  to  check  that  for  every  i  £  X,  there  exists  matrix 
Di  £  such  that  D,  =  [Ri(j,  £)\je-1  In  this  case,  the  expected  payoff  of  player  i  can  be  written 

in  the  simplified  form: 

ufio)  =  ojDiO-i. 

Definition  2.1  (Nash  equilibrium)  A  strategy  profile  a*  =  (off  afi  off)  £  A  is  a  Nash  equilibrium  if, 
for  each  agent  i  £  T, 

ufioffoff)  >  ufioi,  off)  (2) 

for  all  Oi  £  A(|^j|)  and  Oi  ff  off  where  off  denote  the  equilibrium  strategy  profile  of  all  agents  but  i. 

In  the  special  case  where  for  all  i  £  X,  o*  is  a  pure  strategy,  then  the  Nash  equilibrium  is  called  pure 
Nash  equilibrium.  Also,  in  case  the  inequality  in  (2)  is  strict,  the  Nash  equilibrium  will  be  called  a  strict 
Nash  equilibrium. 

3  Perturbed  Learning  Automata 

In  this  section,  we  introduce  the  basic  form  of  reinforcement  learning  that  we  will  consider  in  the  remainder 
of  the  paper.  It  belongs  to  the  general  class  of  learning  automata  [19]. 

The  basic  idea  behind  a  reinforcement  learning  scheme  is  a  rather  simple  one.  If  agent  i  selects  action 
j  at  instant  k  and  a  favorable  payoff  results,  Rfiafik)),  the  action  probability  Oij(k)  is  increased  and  all  the 
other  components  of  ot  (k)  are  decreased.  For  an  unfavorable  payoff,  ol/J(k)  is  decreased  and  all  the  other 
components  of  ofik)  arc  increased. 

The  precise  manner  in  which  ofik)  is  changed  depending  on  the  action  cr,;  performed  at  stage  k  and 
the  response  Rfiafk))  of  the  environment,  completely  defines  the  reinforcement  scheme.  This,  in  turn, 
determines  the  resulting  Markov  process  and  hence  the  behavior  of  the  overall  system. 

For  the  remainder  of  the  paper,  we  will  assume: 

'The  notation  —i  denotes  the  complementary  set  I\{i}.  We  will  often  split  the  argument  of  a  function  in  this  way,  e.g., 

F(a)  =  F(at,  a-i )  or  F(x)  =  F(xi,  x-i). 
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Assumption  3.1  (Strictly  positive  rewards)  For  every  i  £  I,  the  reward  function  satisfies  Uf  a  )  >  ( )  for 
all  a  £  A. 

3.1  Modified  Linear  Reward-Inaction  (£r_i)  scheme 

We  consider  a  reinforcement  scheme  which  is  a  small  modification  of  the  original  linear  reward-inaction 
scheme  (£r_i)  introduced  by  [20,  21].  This  modified  scheme,  denoted  by  £r_i,  was  introduced  in  [14]. 
Compared  with  £r_i,  the  reward  in  £r_i  may  take  values  other  than  {0, 1},  which  increases  the  family  of 
games  that  this  learning  scheme  can  be  applied  to. 

Similarly  to  £r_i,  the  probability  that  agent  i  selects  action  j  at  time  k  is 


aij(k)  =  Xij(k) 


for  some  probability  vector  xfik)  which  is  updated  according  to  the  recursion: 


xfik  +  1)  =  nA  [xfik)  +  e(k)  ■  Rfia(k))  •  [eQ.(fc)  -  xfik)]]  .  (3) 


Here  we  identify  actions  A%  with  vertices  of  the  simplex,  {ei, ...,  For  example,  if  agent  i  selects 

action  j  at  time  k,  then  ea.^  =  c.y .  Note  that  by  letting  the  step-size  sequence  e{k)  to  be  sufficiently  small 
and  since  the  payoff  function  Rfi-)  is  uniformly  bounded  in  A,  xfik)  €  A(|*4j|)  and  the  projection  operator 
can  be  omitted. 

We  consider  the  following  class  of  step-size  sequences: 


e(k) 


1 

kv  +  l 


(4) 


for  some  u  £  (1/2, 1],  For  these  values  of  u,  the  following  two  conditions  can  easily  be  verified: 


OO  OO 

E  6  ^  =  00  and  ^2  e(k)2  <  oo. 
k= 0  k= 0 


(5) 


The  selection  of  v  is  closely  related  to  the  desired  rate  of  convergence. 

Compared  with  prior  reinforcement  schemes,  in  particular  the  models  of  [1, 4],  the  main  difference  lies  in 
the  step-size  sequence.  More  specifically,  in  [1]  the  step-size  sequence  of  agent  i  is  efik)  =  1  / (cku  +  Rfia)) 
for  some  positive  constant  c  and  for  0  <  v  <  1.  A  comparative  model  is  also  used  by  [4]  with  a  step-size 
sequence  to  be  efik )  =  \/{Vfik)  +  Rfia{k)),  where  Vfik)  is  the  accumulated  benefits  of  agent  i  up  to  time 
k,  which  gives  rise  to  an  urn  process  introduced  by  [3].  Some  similarities  are  also  shared  with  the  Cross’ 
learning  model  of  [2],  where  e{k)  =  1  and  Rfia{k))  <  1,  and  its  modification  presented  in  [9],  where 
e(k),  instead,  is  assumed  decreasing.  The  aforementioned  reinforcement  schemes  do  not  have  identical 
convergence  properties  with  £r_j.  Their  differences  will  be  discussed  in  detail  throughout  the  paper. 
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3.2  Pertubed  Linear  Reward-Inaction  Scheme  0Cr_i) 

Here  we  consider  a  perturbed  version  of  the  £r_i  scheme,  in  the  same  spirit  with  [14],  where  the  decisions 
of  each  agent  are  slightly  perturbed.  In  particular,  we  assume  that  each  agent  i  selects  action  j  £  A, 
according  to  the  perturbed  strategy 


aij  -  (!  -  C i(xu  a ))xij  +  Ci(xi,  A)/  I  Ai\ , 


(6) 


for  some  perturbation  function  f  :  A(|A|)  x  [0, 1]  — >  [0, 1]  (usually  called  mutations). 

The  introduction  of  a  state-dependent  mutations  function  intends  on  exploring  how  local  information 
of  each  agent  i,  namely  its  state  xr,  can  alter  the  convergence  properties  of  the  state  of  the  group.  Here, 
we  investigate  one  class  of  such  mutations  function.  In  particular,  we  would  like  f(xr,  A)  to  exhibit  larger 
values  when  the  strategy  Xi  is  close  to  a  vertex  of  the  probability  simplex,  i.e.,  the  strategy  is  close  to  a  pure 
strategy.  Informally,  players  are  “exploring”  the  most  when  they  are  “certain”  about  which  action  to  choose. 

Formally,  we  will  consider  the  following  class  of  perturbation  functions: 

Assumption  3.2  (Perturbation  function)  The  perturbation  function  Q  is  continuously  differentiable.  Fur¬ 
thermore,  for  some  /3  £  (0, 1)  sufficiently  close  to  one,  Q,  satisfies  the  following  properties: 

1.  Cfixi,  A)  =  0  for  all  xy  such  that  <  ft  for  any  A  >  0; 

2.  lim|Xj|oo_>1Ci(a;i,A)  =  A; 

3.  lim|.r.|oo_>1  ^f^|(A=o)  =  c for  some  c  >  0; 

4 ■  liml^loc->!  8C£f}  l(A=0)  =  0  for  any  j  €  A- 

In  other  words,  we  would  like  the  perturbation  function  of  agent  i  (1)  to  be  zero  when  its  strategy  is  not 
close  to  a  vertex  of  A(|  A|);  and  (2)  to  be  equal  to  A  when  its  strategy  is  at  a  vertex  of  A(|  A|)-  Properties 
(3)  and  (4)  are  necessary  in  order  to  analyze  the  behavior  of  the  stochastic  process  in  the  vicinity  of  the 
vertices  of  A(|A|)-  In  particular,  property  (3)  states  that  the  perturbation  increases  with  A,  when  evaluated 
at  a  vertex  of  the  probability  simplex  and  for  A  =  0.  As  we  shall  see  in  a  forthcoming  section,  due  to  this 
property,  vertices  cease  to  be  stationary  points  of  the  mean-held  dynamics  introduced  in  Section  4,  which 
has  favorable  implications  on  the  asymptotic  behavior  of  the  learning  dynamics.  Finally,  property  (4)  states 
that  the  perturbation  does  not  change  with  x  when  evaluated  at  a  vertex  of  the  probability  simplex  and  for 
A  =  0.  Together  with  property  (1),  property  (4)  establishes  equivalence  among  perturbed  and  unperturbed 
dynamics  when  A  =  0. 

For  example,  a  candidate  perturbation  function  is: 


Ci{Xi,\)  = 


(i^KNoo-/3)5 


Xi 


Xi 


<  p, 

>P- 


(V) 


It  is  straightforward  to  check  that  this  function  satisfies  the  properties  of  Assumption  3.2  when  we  select 
(3  £  (0, 1)  sufficiently  close  to  one.  Figure  1  plots  the  candidate  perturbation  function  (7)  about  one  of  the 
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0 


1 


Xij 


Figure  1:  Candidate  perturbation  function  (7). 
vertices  of  the  domain  A(|_4,;|). 

Note  that  the  main  difference  with  the  previously  introduced  scheme  in  [14]  is  that  here  we  allow  for  the 
perturbation  function  of  each  agent  to  also  depend  on  agent’s  own  strategy.  Instead,  in  [14],  the  perturbation 
function  was  assumed  to  be  a  constant  A  >  0  for  all  agents  i  €  X  and  for  all  strategy  vectors  in  A(|A,|). 

We  will  denote  this  scheme  by  £p_j. 

3.3  Discussion 

Similar  ideas  of  state  dependent  mutations  have  been  explored  in  aspiration  learning  [22]  and  in  adaptive 
learning  state-dependent  excitation  of  the  dynamics  is  to  establish  an  [13].  In  both  references,  the  intention 
of  a  equilibrium  selection  scheme  that  will  give  rise  to  more  desirable  outcomes.  For  example,  in  [22],  the 
intention  is  to  show  that  the  aspiration  learning  scheme  will  converge  to  a  Pareto  efficient  action  profile.  In 
a  similar  spirit,  reference  [13]  introduces  globally  state  dependent  mutations  to  show  that  each  action  profile 
can  be  a  stochastically  stable  outcome  of  an  evolutionary  learning  process,  when  we  tailor  appropriately  the 
mutations  function. 

Our  intention  here  is  to  also  use  the  state-dependent  perturbation  function  as  an  equilibrium  selection 
mechanism.  In  comparison  with  [22],  our  goal  is  to  analyze  the  asymptotic  behavior  of  a  class  of  reinforce¬ 
ment  learning  schemes,  whose  behavior  is  quite  different  than  aspiration  learning.  In  comparison  with  [13], 
our  class  of  perturbation  functions  for  each  agent  i  are  also  state  dependent,  however  they  only  depend  on 
the  strategy  of  each  agent  i  and  not  on  the  strategy  profile  of  all  agents. 

Furthermore,  the  introduction  of  such  perturbation  function  serves  as  an  alternative  scheme  for  analyzing 
convergence  to  boundary  points  of  the  probability  simplex  compared  to  prior  analysis  in  both  [1]  and  [4].  In 
particular,  as  [4]  points  out,  the  behavior  of  general  models  of  reinforcement  learning,  such  as  the  model  by 
[1],  cannot  be  directly  related  to  standard  replicator  dynamics  (cf.,  [10,  Chapter  7]).  This  is  mainly  due  to  the 
fact  that  several  models  of  reinforcement  learning  may  converge  to  saddle  points  of  the  standard  replicator 
dynamics.  As  it  will  become  clear  later  on,  such  issues  will  be  sidestepped  here  due  to  the  introduction  of 
the  mutations  function  of  £p_j. 


8 


4  Background  Convergence  Analysis 

Let  =  A°°  denote  the  canonical  path  space  with  an  element  oj  being  a  sequence  (x(0),  x(l), ...},  where 
x(k)  =  (xi  (k), xn{k))  6  A  is  generated  by  the  reinforcement  learning  process.  An  example  of  a 
random  variable  defined  in  O  is  the  function  V’fc  :  ^  —>  A  such  that  =  x(k).  Another  example  of  a 

random  variable  that  we  will  also  use  is  V’fc(w)  =  a(k).  In  several  cases,  we  will  abuse  notation  by  simply 
writing  x(k)  or  a(k)  instead  of  fk (oj).  Let  also  T  be  a  a- algebra  of  subsets  in  11  and  P,  E  be  the  probability 
and  expectation  operator  on  (fi,  F),  respectively.  In  the  following  analysis,  we  implicitly  assume  that  the 
c-algebra  T  is  generated  appropriately  to  allow  computation  of  the  probabilities  or  expectations  of  interest. 

To  analyze  the  asymptotic  behavior  of  the  reinforcement  learning  schemes,  we  will  use  a)  stochastic 
Lyapunov  stability  analysis,  in  order  to  investigate  the  probabilities  that  a  sample  function  exits  from  a 
domain,  and  b)  the  ODE  method  for  stochastic  approximations  in  order  to  investigate  the  probability  of 
convergence  to  invariant  sets  of  the  mean-field  dynamics.  The  background  analysis  which  is  necessary  for 
the  analysis  are  presented  in  the  following  subsections. 

4.1  Exit  of  a  sample  function  from  a  domain 

It  is  important  to  have  conditions  under  which  the  process  k'kix)  =  x(k),  k  >  0,  with  some  initial  distribu¬ 
tion,  will  exit  an  open  domain  G  in  finite  time. 

Proposition  4.1  (Theorem  5.1  in  [23])  Suppose  that  there  exists  a  nonnegative  function,  V(k,  x)  in  the 
domain  k  >  0,  x  £  G,  such  that 

A  V(k,x)  =  K[V(k  +  1  ,x(k  +  1))  —  V(k,x(k))\x(k)  =  x] 

satisfies  AV ( k ,  x)  <  — a(k )  in  this  domain,  where  a(k )  is  a  sequence  such  that 

OO 

a(k)  >  0,  £  a(k )  =  cx).  (8) 

k= o 

Then,  the  process  x(k)  leaves  G  in  a  finite  time  with  probability  1. 

The  following  corollary  is  important  in  cases  we  would  like  to  consider  entrance  of  a  stochastic  process 
into  the  domain  of  attraction  of  an  equilibrium.  It  is  a  direct  consequence  of  Proposition  4. 1 .  For  details,  see 
Exercise  5.1  in  [23]. 

Corollary  4.1  Let  A  C  A,  Bs{A)  its  5 -neighborhood,  and  T>s(A)  =  A \B$(A).  Suppose  there  exists  a 
nonnegative  function  V(k,x)  in  the  domain  k  >  0,  x  €  A  for  which 

A V(k,x)  <  —a(k)ip(k,x),  k>  0,i£  A,  (9) 
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where  the  sequence  a(k)  satisfies  (8)  and  ip(k,  x)  satisfies 

inf  ip(k ,  x)  >  0 

k>T,x€T>s{A ) 

for  all  5  >  0  and  some  T  =  T(5).  Then 

P  liminf dist(x(A;), A)  =  0  =1. 

.  k—> oo 

Corollary  4.1  implies  that  the  stochastic  process  enters  an  arbitrarily  small  neighborhood  of  a  set  A 
infinitely  often  with  probability  one. 

4.2  Convergence  to  mean-field  dynamics 

The  convergence  properties  of  can  be  described  via  the  ODE  method  for  stochastic  approximations. 
In  particular,  the  recursion  of  £p_j,  A  >  0,  can  be  written  in  the  following  form: 

xfik  +  1)  =  xfik)  +  e(k)  ■  [gx(x(k))  +  £*(&)],  (10) 

where  the  observation  sequence  has  been  decomposed  into  a  deterministic  sequence,  ~gxfixfik)),  (or  mean- 
field)  and  a  noise  sequence  £,xfik).  The  mean-field  is  defined  as  follows: 

9i(x )  -  E  [ Rfiafik))[eai{k )  -  Xi(k)]\x(k )  =  x] 

such  that  its  «s-th  entry  is 

9is (*^)  “  Uis{x)(7is  ^  ^  Uiq(x}(JiqXis. 

qdzAi 

It  is  straightforward  to  verify  that  gx(-)  is  continuously  differentiable  due  to  the  definition  of  the  pertur¬ 
bation  function  Q.  The  noise  sequence  is  defined  as 

£,i{k)  =  Ri{a(k))  ■  \eai(k)  ~  xfik)]  -gx(x(k)), 

where  Efix(k)\x(k)  =  x]  =  0  for  all  x  €  A. 

Note  that  for  A  =  0,  (10)  coincides  with  £r_i.  We  will  denote  g(x)  the  corresponding  vector  field. 

The  following  more  compact  form  of  (10)  also  will  be  used: 

xfik  +  1)  =  xfik)  +  efik)  ■  gx(x(k))  +  £x(k)  ,  (11) 

where  gxfi)  =  col{pA(-)}iex  and  £A(-)  =  col{£A(-)}iG:r- 

Proposition  4.2  (Convergence)  For  the  reinforcement  scheme  £A,  j,  A  >  0,  the  stochastic  iteration  (11)  is 
such  that,  for  almost  all  oj  £  Q,  {'tfifioj)  =  xfik)}  converges  to  some  bounded  invariant  set  of  the  ODE: 

x  =  gx(x).  (12) 
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Furthermore,  if  A  C  A  is  a  locally  asymptotically  stable  set  in  the  sense  of  Lyapunov  for  (12), 2  and  x(k)  is 
in  some  compact  set  in  the  domain  of  attraction  of  A  infinitely  often  with  probability  >  p,  then  x(k)  — >  A 
with  at  least  probability  p. 

Proof.  The  proposition  follows  from  Theorem  6.6.1  of  [24]  since  the  following  conditions  are  satisfied: 

—  The  function  gx(-)  is  continuous. 

—  The  sequence  Y(k)  =  ~gx(x(k))  +  £A(/c)  satishes  supfc  E[|Y(A;)|2]  <  oo  since,  by  Assumption  3.1, 
the  reward  function  is  positive  and  bounded  from  above. 

—  The  step  size  sequence  satishes  e(k )2  <  oo  and  e(k)  =  oo. 

□  □ 


5  Stationary  Points 

Stationary  points  of  the  mean-held  dynamics  are  dehned  as  the  set  of  points  x  G  A  for  which  gx(x)  =  0. 
In  the  following  subsections,  we  characterize  the  set  of  stationary  points  for  both  the  unperturbed  (A  =  0) 
and  the  perturbed  dynamics  (A  >  0). 

We  will  make  the  following  distinction  among  stationary  points  of  (12)  for  A  >  0,  denoted  Sx: 

—  Sq a:  stationary  points  in  8 A; 

—  5^* :  stationary  points  which  are  vertices  of  A; 

—  S^o :  stationary  points  in  A°; 

—  5EE:  stationary  points  which  are  Nash  equilibria. 

We  will  also  use  the  notation  Sg/±,  5a*.  5a °,  and  5ne  to  denote  the  corresponding  sets  when  A  =  0. 

5.1  Stationary  Points  of  Unperturbed  Dynamics  (A  =  0) 

Before  describing  the  stationary  points  of  the  mean-held  dynamics  (12)  under  the  unperturbed  reinforcement 
learning  (A  =  0),  it  is  important  to  point  out  that  the  corresponding  mean-held  of  the  share  of  strategy  s  in 
agent  i  when  A  =  0  can  be  written  as: 

disi.'p)  —  I  ^  ^  Uiq(x)Xiq  J  Xis  (13) 

y  qCiA-i  J 

2If  {x(t)  :  t  >  0}  denotes  the  solution  of  the  ODE  (12).  then  a  set  A  C  A  is  a  locally  asymptotically  stable  set  in  the  sense  of 
Lyapunov  for  the  ODE  (12)  if  a)  for  each  e  >  0,  there  exists  5  =  5(e)  >  0  such  that  dist(*(0),  A)  <  5  implies  dist(a:(f),  A)  <  e 
for  all  t  >  0,  and  b)  there  exists  J  >  0  such  that  dist(at(0),  A)  <  5  implies  limt-^oo  dist(x(t),  A)  =  0. 
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which  coincides  with  the  corresponding  shares  provided  by  the  standard  replicator  dynamics,  as  pointed  out 
by  [2,  Proposition  1].  This  form  of  dynamics  can  also  be  thought  of  as  a  special  class  of  imitative  dynamics, 
as  discussed  in  [25,  Section  5.4]. 

The  following  more  compact  form  of  standard  replicator  dynamics  will  be  more  convenient: 

9i{x)  =  Xi(xi)  ■  Ui(x),  i  G  Z,  (14) 

where  AQ  :  A(|Al;|)  — >  rIAMAI^  such  that  [Xfixifjj  =  Xij(  1  —  Xij )  for  any  j  G  Ai  and  [Xi(xi)\jk  = 
-XijXik  for  any  j,  k  G  Ai,  with  j  /  k. 

The  following  proposition  and  corollaries  characterize  the  stationary  points  of  the  ODE  (12)  for  A  =  0 
and  are  well  known  results  for  replicator  dynamics  (see,  e.g.,  Section  3.3.1  in  [26]). 

Proposition  5.1  (Stationary  points  of  unperturbed  dynamics)  For  A  =  0,  a  strategy  profile  x*  is  a  sta¬ 
tionary  point  of  the  ODE  (12)  if  and  only  if,  for  every  agent  j  Gl,  there  exists  a  constant  Ci  >  0,  such  that 
for  any  action  j  G  Ai,  x*j  >  0  implies  Uij(x*)  =  c^. 

Two  straightforward  implications  of  Proposition  5.1  are: 

Corollary  5.1  (Pure  Strategies)  For  A  =  0,  any  pure  strategy  profile  is  a  stationary  point  of  the  ODE  (12). 

Corollary  5.2  (Nash  Equilibria)  For  A  =  0,  any  Nash  equilibrium  is  a  stationary  point  of  the  ODE  (12). 

Note  that  for  some  games  not  all  stationary  points  of  the  ODE  (12)  are  Nash  equilibria.  For  example, 
if  you  consider  the  Typewriter  Game  of  Table  1,  the  pure  strategy  profiles  which  correspond  to  (A,  B)  or 
( B ,  A)  are  not  Nash  equilibria,  although  they  are  stationary  points  of  (12). 


A 
B 

Table  1 :  The  Typewriter  Game. 

On  the  other  hand,  any  stationary  point  in  the  interior  of  the  probability  simplex  will  necessarily  be  a 
Nash  equilibrium  as  the  following  corollary  states: 

Corollary  5.3  (Mixed  Nash  equilibria)  For  A  =  0,  any  stationary  point  x*  of  the  ODE  (12)  for  A  =  0, 
such  that  x*  G  A°,  is  a  (mixed)  Nash  equilibrium  of  the  game. 

Note  that  the  above  corollaries  do  not  exclude  the  possibility  that  there  exist  stationary  points  in  <9  A 
without  those  necessarily  being  pure  strategy  profiles.  For  the  remainder  of  the  paper,  we  will  only  consider 
games  which  satisfy  the  following  property: 

Assumption  5.1  For  the  unperturbed  dynamics,  there  are  no  stationary  points  in  dA.  other  than  the  ones 
in  A*,  i.e.,  Sq^\Sa*  =  0-  Moreover,  if  S/\°  f  0,  there  exists  S  >  0  such  that  B$(S&o)  c  A°. 


A  B 


4,4 

2,2 

2,2 

3,3 
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In  other  words,  we  only  consider  games  for  which,  the  stationary  points  of  (12)  for  A  =  0  in  the  boundary 
of  A  are  vertices  of  A,  and  the  stationary  points  in  A°  arc  isolated  from  the  boundary.  Assumption  5.1  is 
not  restrictive  and  is  satisfied  for  most  but  trivial  cases. 

Note  also  that  Assumption  5.1  does  not  exclude  the  possibility  that  the  vector  field  g(x)  exhibits  invariant 
sets  other  than  stationary  points. 

5.2  Stationary  Points  of  Perturbed  Dynamics  (A  >  0) 

A  straightforward  implication  of  the  properties  of  the  perturbation  function  is  the  following: 

Lemma  5.1  (Sensitivity  of  5a°)  There  exists  /Jq  £  (0, 1)  such  that  5a°  C  5^„  for  any  6q  <  6  <1  and 
any  A  >  0. 

Proof.  Due  to  Assumption  5.1,  there  exist  /3q  €  (0, 1)  sufficiently  close  to  one  and  A  >  0,  such  that,  for  any 
/3o  <  P  <  1,  we  have  £ \{xi,  A)  =  0  for  all  i  G  I  and  x  £  B§{S/±o).  Thus,  the  conclusion  follows.  □  □ 

Vertices  of  A  cease  to  be  equilibria  for  A  >  0.  The  following  proposition  provides  the  sensitivity  of 
5a*  to  small  values  of  A. 

Lemma  5.2  (Sensitivity  of  5a*)  For  any  stationary  point  x*  G  <Sa* ,  which  corresponds  to  a  strict  Nash 
equilibrium  and  for  sufficiently  small  A  >  0,  there  exists  a  unique  continuously  differentiable  function 
u*  :  M_|_  —A  Ml"4!,  such  that  lim^pj  zA(A)  =  zA(0)  =  0,  and 

x  =  x*  +  i/*(A)  £  A°  (15) 

is  a  stationary  point  of  the  ODE  (12).  If  instead  x*  £  5a*  is  not  a  Nash  equilibrium,  then  for  any  sufficiently 
small  5  >  0  and  A  >  0,  the  5-neighborhood  of  x*  in  A,  13$ (x*),  does  not  contain  any  stationary  point  of 
the  ODE  (12). 

Proof.  The  proof  follows  similar  reasoning  with  the  Proof  of  Proposition  3.5  in  [14].  In  the  Appendix  A, 
we  present  the  main  steps  of  the  proof.  □  □ 

Note  that  the  statements  of  Lemma  5.2  do  not  depend  on  the  selection  of  (3.  Instead,  they  require  A  to  be 
sufficiently  small.  Also,  note  that  Lemma  5.2  does  not  discuss  the  sensitivity  of  Nash  equilibria  which  are 
not  strict.  However,  it  is  straightforward  to  show  that  vertices  cannot  be  stationary  points  for  A  >  0. 

Let  also  5^E  denote  the  set  of  stationary  points  in  A°  which  are  perturbations  of  the  stationary  points 
in  5a*  n  5ne  ( strict  or  non- strict )  for  some  A  >  0. 

Proposition  5.2  (Stationary  points  of  perturbed  dynamics)  For  any  (3  £  (0,1),  let  5*  =  5*(/3)  be  the 

smallest  5  >  0  such  that,  for  all  x  £  A\55(A*),  Q{xi,  A)  =  0  for  some  i  £  I.  When  (3  is  sufficiently  close 
to  one  and  A  >  0  is  sufficiently  small,  then: 


(b)  S A  —  5a°  U 

In  other  words,  the  stationary  points  of  the  perturbed  dynamics  are  either  the  interior  stationary  points  of 
the  unperturbed  dynamics  or  perturbations  of  pure  Nash  equilibria.  Proposition  5.2  is  an  immediate  impli¬ 
cation  of  Lemmas  5. 1-5.2  and  Assumption  5.1.  Proof.  Pick  (3  >  /jq,  where  is  defined  in  Lemma  5.1. 
Then  5A°  C  =  5A.  The  rest  of  the  stationary  points  are  perturbations  of  the  vertices  characterized 
by  Lemma  5.2.  Due  to  the  definition  of  5*  =  S*(/3),  we  have  <S^E  C  B$*( A*),  since  outside  B$*{  A*) 
the  dynamics  coincide  with  the  unperturbed  dynamics  for  at  least  one  agent.  When  we  further  take  /3  to  be 
sufficiently  close  to  one  (which  implies  that  =  S*(/3)  approaches  zero)  and  A  sufficiently  small,  then, 
according  to  Lemma  5.2,  <S^E  are  the  only  stationary  points  in  Bg*( A*),  and  therefore  Sx  =  <SAo  U  <SA-E. 
□  □ 

Note  that  due  to  the  introduction  of  the  state-dependent  perturbation  function  in  the  decision  rule  of  the 
players,  vertices  of  A  cease  to  be  stationary  points  of  the  ODE  (12)  when  A  >  0.  Due  to  this  property, 
the  introduction  of  the  state-dependent  perturbation  function  will  address  issues  related  to  showing  noncon¬ 
vergence  to  boundary  points  which  do  not  correspond  to  Nash  equilibria  [27,  4].  This  will  become  more 
apparent  in  the  forthcoming  Section  6  when  we  discuss  the  probability  of  convergence  to  boundary  points. 

5.3  Demonstration 

To  demonstrate  the  sensitivity  of  the  stationary  points  to  the  perturbation  function,  we  plot  the  vector  field 
of  the  ODE  (12)  in  the  vicinity  of  A*,  i.e.,  the  vertices  of  the  domain  A.  For  demonstration  purposes,  we 
assume  that  there  are  two  agents  whose  utility  function  is  defined  by  Table  1,  i.e.,  there  are  two  pure  Nash 
equilibria  corresponding  to  action  profiles  (A,  A)  and  (B,  B). 

Fig.  2  plots  the  vector  field  of  the  ODE  (12)  in  the  vicinity  of  a  non-Nash  pure  strategy  profile,  specifi¬ 
cally  in  the  vicinity  of  (B,  A)  which  corresponds  to  strategies  1  —  x\ 1  =  x.2i  =  1.  We  observe  that  this  is 
a  stationary  point  of  the  ODE  (12)  for  A  =  0,  while  it  is  no  longer  a  stationary  point  when  A  =  0.01.  This 
conclusion  agrees  with  the  second  statement  of  Lemma  5.2  which  states  that  for  sufficiently  small  neighbor¬ 
hood  Bs(x*)  of  a  non-Nash  action  profile  x* ,  and  for  sufficiently  small  A  >  0,  B$(x*)  does  not  contain  any 
stationary  point  of  the  ODE  (12). 

On  the  other  hand,  Fig.  3  plots  the  vector  held  of  the  ODE  (12)  in  the  vicinity  of  (B,  B)  which  corre¬ 
sponds  to  strategies  x\x  =  x*n  =  0.  As  expected,  when  A  =  0,  this  strategy  allocation  corresponds  to  a 
stationary  point  of  the  ODE  as  shown  in  Fig.  3(a).  When,  instead,  A  =  0.01  in  Fig.  3(b),  observe  the  slight 
displacement  of  the  original  stationary  point  towards  the  interior  of  the  probability  simplex  as  predicted  by 
the  first  statement  of  Lemma  5.2. 
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Figure  2:  Sensitivity  of  a  non-Nash  stationary  point  to  A:  (a)  A 
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0,  (b)  A  =  0.01  and  /3  =  0.9. 
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Figure  3:  Sensitivity  of  a  strict  Nash  stationary  point  to  A:  (a)  A  =  0,  (b)  A  =  0.01  and  3  =  0.9. 


6  Convergence  to  Boundary  Points 

Recall  that,  for  the  unperturbed  dynamics,  not  all  stationary  points  in  A*  are  necessarily  Nash  equilibria. 
Convergence  to  non-desirable  stationary  points,  such  as  the  ones  which  are  not  Nash  equilibria,  cannot  be 
excluded  when  agents  employ  the  unperturbed  reinforcement  scheme  £r_i. 

Proposition  6.1  (Convergence  to  boundary  points)  Under  the  reinforcement  scheme  £r_i,  the  probabil¬ 
ity  that  the  same  action  profile  will  be  played  for  all  future  times  is  uniformly  bounded  away  from  zero  over 
all  initial  conditions  if  Ufa)  >  1/or  each  a  G  A,  i  G  I. 

Proof.  See  Appendix  B.  □ 


Proposition  6.1  reveals  the  main  issue  of  applying  reinforcement  learning  schemes,  which  is  convergence 
with  positive  probability  to  boundary  points  which  are  not  Nash  equilibrium  profiles. 
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Fig.  4  shows  a  typical  response  of  £r_i  in  the  Typewriter  Game  of  Table  1 .  We  observe  that  it  is  possible 
for  the  process  to  converge  to  a  pure  strategy  profile  which  is  not  a  Nash  equilibrium  when  Ri(a)  >  1  for 
all  a  £  A  and  i  £  X. 


Time  step  k 


Figure  4:  Typical  response  of  £r_i  on  the  Typewriter  Game  of  Table  1  when  v  =  0.78. 

This  issue  has  been  addressed  before  in  references  [27,  4].  In  particular,  in  [27]  the  reinforcement 
model  of  [1]  was  considered.  The  only  difference  of  the  learning  model  of  [1]  with  £r_i  is  the  step-size 
sequence,  which  in  case  of  [1]  is  defined  as  e(k)  =  l/(ck,J  +  Ri(a ))  for  some  positive  parameter  c  and  for 
0  <  v  <  1.  Reference  [27]  showed  that  convergence  to  pure  strategy  profiles  which  are  not  Nash  equilibria 
can  be  excluded  as  long  as  c  >  Ri(a )  for  all  i  £  X  and  v  =  1.  This  statement  agrees  with  Proposition  6.1, 
since  in  £r_i  we  have  c  =  1. 

An  alternative  approach  for  guaranteeing  nonconvergence  to  stationary  points  which  are  not  Nash  equi¬ 
libria  is  the  urn  process  of  [3].  This  model  can  be  rewritten  in  the  recursive  form  of  £r_i  for  which  the 
step-size  sequence  will  be  ei(k)  =  1  /(A(k)  +  Rj(a)),  where  V,,(k)  is  the  accumulated  benefits  of  agent  i 
up  to  time  k.  This  model  has  been  analyzed  in  [4],  where  it  was  shown  that  the  recursion  converges  with 
probability  zero  to  any  stationary  point  of  the  replicator  dynamics  which  is  not  a  Nash  equilibrium.  How¬ 
ever,  as  [4]  points  out  and  we  also  showed  in  Proposition  6. 1 ,  similar  statements  cannot  be  derived  for  more 
general  reinforcement  learning  schemes. 

The  perturbed  reinforcement  scheme  Tj']  _T  introduced  in  Section  3  will  provide  an  alternative  approach 
for  dealing  with  nonconvergence  to  pure-strategy  profiles  which  are  not  Nash  equilibria  and  will  allow  for 
establishing  a  connection  to  standard  replicator  dynamics  (13). 
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7  Convergence  of  Perturbed  Dynamics  (£r_j) 

The  convergence  analysis  of  the  perturbed  dynamics  will  be  subject  to  the  following  assumption: 

Assumption  7.1  For  the  unperturbed  dynamics,  £r-i,  there  exists  a  twice  continuously  differentiable  and 
nonnegative  function  V  :  A  -G  M+  such  that 

(a)  VxV(x)T'g(x)  <  0  for  all  x  G  A; 

(b)  VxV(x)T'g(x)  =  0  if  and  only  ifg(x)  =  0. 

For  some  5  >  0,  consider  the  5-neighborhood  of  the  set  of  stationary  points  Sx,  Br)(Sx  ).  Define  also 
the  closed  set:  T>s(Sx)  =  A\Bs(Sx). 

Lemma  7.1  Under  Assumption  7.1,  for  ft  G  (0, 1)  sufficiently  close  to  one  and  A  >  0  sufficiently  small, 
there  exists  5  =  5((3,  A)  >0  such  that 


sup  VxV(x)Tgx(x)  <  0. 
xeT>s(S*) 

Proof.  Pick  5*  =  5*  ( (3 )  according  to  Proposition  5.2,  such  that,  for  all  x  G  A \B$*  (A*),  Q f xy .  A)  =  0  for  at 
least  one  agent  i.  Then,  according  to  Proposition  5.2,  when  we  take  !3  sufficiently  close  to  one  (which  implies 
that  6*  approaches  zero)  and  A  sufficiently  small,  then  (a)  <S^E  C  B$*( A*),  and  (b)  Sx  =  5a°  U  5ee.  Due 
to  Assumption  7.1,  there  exists  5  =  S(/3 ,  A)  >  5*  such  that  Bs*  (A*)  c  Bs(Sx )  and 

sup  VxV(x)Tgx(x)  <  0. 

xe  vs(sx) 


Thus,  the  conclusion  follows.  □  □ 

Lemma  7.2  (LAS  -  £e__j)  For  any  A  >  0  sufficiently  small,  any  stationary  point  x  G  5EE,  which  is  a 
perturbation  of  a  strict  Nash  equilibrium  according  to  (15),  is  a  locally  asymptotically  stable  point  of  the 
ODE  (12). 

Proof.  The  proof  follows  similar  reasoning  with  the  proof  of  Proposition  3.6  in  [14].  □  □ 


Theorem  7.1  (Convergence  to  Nash  equilibria)  Under  Assumption  7.1,  if  agents  employ  the  perturbed 
reinforcement  scheme  C^.  j  for  some  /3  G  (0, 1)  sufficiently  close  to  one  and  A  >  0  sufficiently  small,  then 
there  exists  5  =  5(/3,  A)  such  that, 


lirninf  dist(x(fc),  Bs(Sx))  =  0 

k — ^OO 


=  1. 


Also,  for  almost  all  oj,  the  process  (V’fc(w)  =  x(k)}  converges  to  some  invariant  set  in  Bs(Sx). 
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Proof.  Consider  the  nonnegative  function  V(x)  of  Assumption  7.1.  We  can  approximate  the  expected 
incremental  gain  of  V (x)  by  applying  a  Taylor  series  expansion  as  follows: 

A  V(k,x)  =  VxI/(x)TE[x(&:  +  1)  —  x(k)\x(k)  =  x]  +  0(e(/c)2), 

where  0(e(k)2)  denotes  terms  of  order  e(k)2.  Note  that  such  an  expansion  is  possible  due  to  the  fact  that 
the  second-order  derivatives  of  V(-)  are  continuous  in  A.  Equivalently, 

A  V{k,x)  =  e{k)VxV(x)Tgx{x)  +  0(e(k)2).  (16) 

Due  to  Lemma  7.1,  there  exists  6  =  6(6,  A)  >  0  such  that 

-p=  sup  WxV(x)Tgx(x)  <  0. 

z€XM<SA) 


Thus, 


AV(k,x)  <  -e(k)p  +  0(e(k)2), 


uniformly  in  x  £  T>s(Sx).  The  right-hand  side  of  the  above  inequality  is  strictly  negative  and  can  be  formu¬ 
lated  in  the  form  of  condition  (9).  Therefore,  the  conditions  of  Corollary  4.1  are  satisfied  and 


lirninf  dist(x(fc),  B$(SX))  =  0 

k—>  oo 


=  l. 


From  Proposition  4.2,  we  also  have  that  the  process  {^(cu)  =  x(k)}  will  converge  to  some  invariant 
set  of  the  ODE  in  B;)(SX)  almost  surely.  □  □ 


Theorem  7.1  is  an  immediate  implication  of  Assumptions  5. 1-7.1  and  Lemma  7.1  and  shows  that  the 
perturbed  reinforcement  scheme  converges  almost  surely  to  some  invariant  set  within  an  arbitrarily 
small  neighborhood  of  the  stationary  points  Sx  of  the  perturbed  dynamics. 

Recall  also  that,  according  to  Proposition  5.2,  the  set  Sx  includes  a)  the  stationary  points  in  A°  which 
are  perturbations  of  5a*  Cl  5ne  (i.e.,  stationary  points  which  correspond  to  strict  and  non-strict  pure  Nash 
equilibria),  and  b)  interior  stationary  points  of  the  unperturbed  dynamics,  5a°  (e.g.,  mixed  Nash  equilibria). 
Although  Theorem  7.1  does  not  explicitly  characterize  the  invariant  sets  within  Bs(Sx )  to  which  conver¬ 
gence  is  attained,  5  =  6(6.  A)  can  become  arbitrarily  small  by  appropriately  selecting  6  and  A. 

If  we  further  exclude  convergence  to  5a°  (which  is  possible  for  some  classes  of  games  as  we  shall  see  in 
the  forthcoming  Section  8),  then  Theorem  7.1  will  imply  convergence  to  an  arbitrarily  small  neighborhood 
of  5a-e.  In  this  case,  given  that  the  stationary  points  in  5EE  can  become  arbitrarily  close  to  the  correspond¬ 
ing  vertices  of  the  simplex  (due  to  Proposition  5.2(a)),  Theorem  7.1  implicitly  implies  convergence  to  an 
arbitrarily  small  neighborhood  of  the  corresponding  vertices  of  the  simplex  (i.e.,  the  ones  corresponding  to 
A-perturbations  of  Nash  equilibria  of  the  game).  We  will  discuss  this  observation  in  greater  detail  in  the 
forthcoming  Section  8. 
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8  Specialization  to  Potential  Games 

8.1  Potential  games 

In  this  section,  we  will  specialize  the  convergence  analysis  of  Section  7  to  a  class  of  games  which  belongs  to 
the  general  family  of  ordinal  potential  games  (cf.,  [15]).  In  particular,  we  will  consider  games  which  satisfy 
the  following  property: 

Property  8.1  There  exists  a  C2  function  f  :  A  — >•  R  such  that 

VaJ(a)  =  Ui(a) 


for  all  a  £  A  and  i  El. 

This  property  has  been  used  to  define  potential  games  in  population  games  [28],  where  players  from 
an  infinite-size  population  are  paired  to  play  the  game,  and  a  corresponds  to  the  average  strategy  in  the 
population.  The  model  here  is  equivalent,  since  instead  of  an  infinite-size  population  of  players  and  finite 
strategies,  we  consider  a  finite  number  of  players  with  a  continuum  of  strategies.  A  straightforward  calcula¬ 
tion  can  show  that  the  function  /  serves  as  a  potential  function  under  the  definition  of  [15],  since  for  every 
i£l  and  at,  a[  €  A(|  Ai\),  we  have 

/{a-,  a-i)  -  /(crj,  cr_j)  =  VCTi/(cr)T(cr'  -  af) 

=  Ui(a)T(a,i  -  <n) 

—  rq (a ^ ;  (J—f)  rtj(cTj,  a—f) 

where  the  first  equality  results  from  the  Taylor  series  expansion  of  /  about  cr  =  ( a, .  a-i )  €  A  and  the  fact 
that  V2.  f  (a)  =  0. 

Example  1:  (Common-payoff  games)  One  class  of  games  which  satisfies  Property  8.1  is  common-payoff 
or  identical-interest  games,  where  the  payoff  function  is  the  same  for  all  players.  In  other  words,  there  exists 
a  function  d  :  A  — >•  R+  such  that  the  expected  payoff  of  player  i  €  1  at  strategy  profile  a  is: 

Ui(a)  =  ^2  d(a)  akak. 
a£A  k£l 

Define  f(a)  =  Ui(a)  for  some  i£l,  Then,  it  is  straightforward  to  check  that 

=  J2  d(a )  Yl  °kak  =  Uij(a), 

12  {a:ai=j}  k£—i 

i.e.,  /  satisfies  Property  8.i.  An  example  of  a  common-payoff  game  is  the  Typewriter  Game  of  Table  1. 

Example  2:  ( Congestion  games )  A  typical  congestion  game  consists  of  a  set  Z  of  n  players  and  a  set  V 
of  m  paths.  For  each  player  i,  let  the  set  of  pure  strategies  Ai  be  the  set  of  m  paths.  The  cost  to  each  player  i 
of  selecting  the  path  p  depends  on  the  number  of  players  that  are  using  the  same  path.  The  expected  number 
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of  players  using  path  p  is 


Xp(<r)  - 

iei 

Define  cp  =  cp(xp )  to  be  the  cost  of  using  path  p  when  xv  players  are  using  path  p  and  let  cp(xp)  be  linear 
on  Xp-  The  expected  utility  of  player  i  is  defined  as: 


Note  that  the  function 


satisfies  Property  8.1. 


=  -^2cp(xP{a)). 
p£V 


f(<?) 


A 


cp(z)dz 


8.2  Convergence  to  Nash  equilibria 

The  following  proposition  establishes  convergence  to  Nash  equilibria  for  this  class  of  potential  games. 

Proposition  8.1  (Convergence  to  Nash  equilibria)  For  the  class  of  games  satisfying  Property  8.1,  the  per¬ 
turbed  reinforcement  scheme  C^{  _j  satisfies  the  conclusions  of  Theorem  7.1. 

Proof.  It  suffices  to  show  that  the  conditions  of  Assumption  7.1  are  satisfied  for  the  unperturbed  dynamics. 
In  particular,  define  the  nonnegative  function 


V (x)  =  /max  —  f{x)  >  0,  X  <G  A,  (17) 

where  /max  =  supxeA  f(x).  Note  that  XXiV{x)  =  -Ufix),  and 

Ufixfgfix)  =  Ufix)'1  XfixijUfix) 

I  Ail  |  Ai| 

—  ^  ^  '  %is%ijiJJisi.x')  Uij{x)Y 

s=l  j=l,j>s 

>  o. 

Thus, 

VxV(x)Tg(x)  =  -U(x)Tg(x)  =  -  E  Ufixf  XfixjUfix)  <  0 

iei 

for  all  x  £  A. 

We  also  observe  that  VXV (x)Tg(x)  =  0  if  and  only  if  Uis(x)  =  Ujfix)  for  any  i  £  1  and  any  s,  j  £  A,, 
s  Y  j  such  that  Xis ,  Xij  >  0.  By  Proposition  5.1,  these  points  correspond  to  the  stationary  points  of  7j(x). 
Therefore,  the  conditions  of  Assumption  7.1  are  also  satisfied.  Thus,  the  conclusions  of  Theorem  7.1  hold 
for  the  class  of  games  satisfying  Property  8.1.  □  □ 
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8.3  Convergence  to  pure  Nash  equilibria 


In  several  games,  convergence  to  mixed  Nash  equilibria  of  the  unperturbed  dynamics  5a°  can  be  excluded. 
In  this  case,  convergence  of  the  perturbed  dynamics  to  stationary  points  in  which  are  perturbations  of 
pure  Nash  equilibria  can  be  established. 

Let  x-i  denote  the  distribution  over  action  profiles  of  the  group  of  agents  —  i.  Let  Di  be  the  matrix  of 
payoffs  of  agents  i  and  D_,  be  the  matrix  of  payoffs  of  —i.  The  vector  of  expected  payoffs  of  agent  i  and 
— i  can  be  expressed  as  Ufix)  =  DiX-i  and  U-fix)  =  /J_,x,,  respectively. 

To  analyze  the  behavior  around  stationary  points  in  A°,  we  consider  the  nonnegative  function  V (x)  = 
/max  —  /(x)  >  0,  x  £  A,  where  /max  —  supa,eA  f(x).  It  is  straightforward  to  verify  that  the  Jacobian 
matrix  of  f(x)  is: 


V2xf(x) 


O  Di  \ 
D-i  O  )  ' 


Higher-order  derivatives  of  f(x)  will  be  zero,  therefore  from  the  extension  of  Taylor’s  Theorem  (cf.,  Theo¬ 
rem  5.15  in  [29])  to  multivariable  functions,  we  have: 


A  V(k,x)  =  —  V  xf  (x)TE[5x(k)\x(k)  =  x]  — 

E[5x_i(fe)rr  D-i5xi(k)\x(k)  =  x]  — 

E[5xj(A:)T  Didx-i(k)\x(k)  =  x],  (18) 


where  5x(k)  =  x(k  +  1)  —  x(k). 

A  direct  consequence  of  the  above  formulation  and  Proposition  4.1  is  the  following: 

Proposition  8.2  (Nonconvergence  to  <Sa°)  If  agents  employ  the  unperturbed  reinforcement  scheme  £r_i 
and  x*  £  S a»  satisfies 

1.  E[5x_i(fe)T D-iSxi(k)\x(k)  =  x]  >  0, 

2.  K[6xi(k)rj- Di5x-i(k)\x(k)  =  x]  >  0 

uniformly  in  x  £  B${x*),  for  some  5  >  0  sufficiently  small,  then 


lim  x(k)  =  x* 

k — Zoo 


=  0. 


Proof.  We  consider  the  nonnegative  function  V (x)  defined  above.  Note  that  the  expected  incremental  gain 
of  V (x)  (18),  under  the  unperturbed  dynamics,  can  take  the  following  form: 


V(k,x)  =  —e(k)fi(k,x) 


where  inf xejs5(x*)  oEA:,  x)  >  0  for  some  6  >  0  sufficiently  small  and  for  all  k.  This  is  due  to  the  fact  that  for 

any  x  £  Bfix*), 

“Va;/(x)TE[(5x(/i:)|x(A:)  =  x]  <  0 
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(due  to  Proposition  8.1),  and  the  second-order  terms  of  the  incremental  gain  are  strictly  negative  by  assump¬ 
tion.  Then,  from  Proposition  4.1,  we  conclude  that  the  unperturbed  process  will  exit  Bs(x*)  in  finite  time 
with  probability  one.  Therefore,  the  conclusion  follows.  □  □ 

For  several  games  testing  the  conditions  of  Proposition  8.2  may  be  hard.  However,  for  two-player  two- 
action  games,  it  is  straightforward  to  show  that: 


E[SxiT  Di5x-i\xi(k)  =  Xi,x-i(k)  =  X-fi  = 

e(fe)2Xjl®i2^(-i)l2;(-i)2(<^ll  —  4.2  —  4l  +  ^22)' 

((4i)2  -  K2)2  -  (4i)2  +  (42)2),  (19) 

where  d's(::  denotes  the  (s,i)  entry  of  I), ,  i  =  1,2.  Consider,  for  example,  the  Typewriter  Game  of  Table  1. 
Since  the  game  is  symmetric,  and  d\x  >  d\2,  d2 2  >  d21,  i  =  1,2,  the  second-order  terms  of  the  incremental 
gain  will  be  positive.  The  above  computation  can  be  extended  in  a  similar  manner  to  the  case  of  larger 
number  of  actions. 


Proposition  8.3  (Convergence  to  pure  Nash  equilibria)  In  the  framework  of  Proposition  8.1,  let  the  con  ¬ 
ditions  of  Proposition  8.2  also  hold.  If  the  game  admits  pure  Nash  equilibria  which  are  all  strict,  then,  for 
some  j5  G  (0, 1)  sufficiently  close  to  one  and  A  >  0  sufficiently  small,  the  perturbed  process  (V’fc(w)  =  x(k)} 
converges  to  the  set  S^Efor  almost  cdl  oj,  i.  e. , 


lim  x(k)  G  SnE 

L  oo 


=  l. 


Proof.  Since  the  game  exhibits  pure  Nash  equilibria  which  are  all  strict,  the  set  <S^E  in  non-empty  for  any 
A  >  0  sufficiently  small. 

Let  x*  denote  an  action  profile  which  is  a  strict  pure  Nash  equilibrium,  i.e.,  for  every  i  G  'I  there  exists 
j*  =  j*(i)  such  that  Xij*  =  1  and  Uis(x*)  —  Uij*(x*)  <  0  for  any  s  j*.  Let  also  x  G  be  the 
perturbed  stationary  point  according  to  (15).  Pick  also  (5*  =  <5*(/3)  >  0  similarly  to  the  proof  of  Lemma  7.1. 
Then,  for  any  x  G  B,y>  (x),  xls  is  of  order  of  5*  and 


9is(x)  ~  Pis{x*)  -  Uij*{x*)\  xis 


(20) 


plus  higher  order  terms  of  5*  and  A,  for  all  s  f  j* .  Since  Uts(x*)  —  Utp  (x*)  <  0  for  all  s  f  j*,  we  conclude 
that  the  vector  held  points  towards  the  interior  of  B$*  (x)  when  evaluated  at  the  boundary  of  Bs*  (x).  Thus, 
Bs*(x)  is  an  invariant  set  of  the  ODE  (12).  Therefore,  due  to  Proposition  8.2  and  Theorem  7.1,  if  we  take 
/ 3  G  (0, 1)  sufficiently  close  to  one  and  A  >  0  sufficiently  small,  then  there  exists  <5  =  6(8,  A)  >  6*  such 
that  the  process  (x(A:)}  converges  almost  surely  to  some  invariant  set  in  BsiS^y). 

Furthermore,  due  to  Lemma  7.2,  we  know  that  the  points  in  <S^E  are  locally  asymptotically  stable,  and 
therefore  by  (20),  the  set  13,5 (N^E)  belongs  to  its  region  of  attraction.  Since  the  perturbed  process  visits 
B,)  (N^e )  infinitely  often,  by  Proposition  4.2,  we  conclude  that  the  process  converges  to  <S^E  with  probabil¬ 
ity  one.  □  □ 
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Proposition  8.3  specializes  the  conclusions  of  Theorem  7.1  to  the  case  where  i)  Property  8.1  is  satisfied, 
ii)  the  pure  Nash  equilibria  of  the  game  are  all  strict,  and  iii)  the  hypotheses  of  Proposition  8.2  also  hold  (i.e., 
convergence  to  mixed  Nash  equilibria  can  be  excluded).  Proposition  8.3  shows  that  asymptotic  convergence 
to  the  set  of  stationary  points  <SEF  (i.e.,  A-perturbations  of  pure  Nash  equilibria)  can  be  achieved  almost 
surely. 

In  case  the  game  exhibits  pure  Nash  equilibria  which  arc  not  strict,  the  conclusions  of  Proposition  8.3 
might  not  hold.  However,  Theorem  7.1  still  applies.  In  particular,  under  the  hypotheses  of  Proposition  8.2, 
Theorem  7.1  implies  that  the  perturbed  process  will  converge  almost  surely  to  an  invariant  set  within  an 
arbitrarily  small  neighborhood  of  (i.e.,  A-perturbations  of  pure  Nash  equilibria).  Furthermore,  due  to 
Proposition  5.2,  the  stationary  points  can  become  arbitrarily  close  to  the  corresponding  vertices  of  the 
simplex  by  appropriately  selecting  parameters  (3  and  A.  We  conclude  that,  even  if  the  game  exhibits  pure 
Nash  equilibria  which  are  not  strict,  Theorem  7. 1  implies  that  the  perturbed  process  converges  almost  surely 
to  an  invariant  set  within  an  arbitrarily  small  neighborhood  of  the  vertices  corresponding  to  5^E. 

8.4  Extension  to  Two-Player  Rescaled  Partnership  Games 

In  two-player  games,  the  convergence  results  of  Propositions  8. 1-8.3  can  be  extended  to  two-player  rescaled 
partnership  games,  introduced  by  [10]  and  defined  as  follows: 

Definition  8.1  (Two-Player  Rescaled  Partnership  Games)  A  two-player  game  with  payoff  matrices  Di, 
i  £  {1,2},  is  a  rescaled  partnership  game  if  there  exist  positive  numbers  a,i,  i  €  {1,2},  and  matrices 

C*  =  (S  S)£R2x2-  i£l- 

such  that  the  two-player  game  with  payoff  matrices 

D[  =  ai Dt  +  Ci,  i  £  {1,  2}, 
define  a  partnership  game,  i.e.,  D\  =  (DP)T. 

Note  that  two-player  partnership  games  are  also  potential  games  with  potential  function  /  :  A  — >  R 
such  that 

f(a)  4  ajD[a.h  (21) 

for  some  i  £  {1,  2}. 

As  already  pointed  out  by  [4],  two-player  rescaled  partnership  games  exhibit  a  nice  property  in  connec¬ 
tion  with  standard  replicator  dynamics,  summarized  in  the  following  claim. 

Claim  8.1  For  any  two-player  rescaled  partnership  game  and  for  any  x  £  A,  the  following  holds: 

Xi{xi)D'ix-i  =  aiXfixfiDiX-i,  *£{1,2}.  (22) 
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Due  to  this  property,  convergence  to  Nash  equilibria  in  rescaled  partnership  games  can  be  established  under 
the  perturbed  reinforcement  scheme 

Proposition  8.4  (Convergence  in  Rescaled  Partnership  Games)  In  the  class  of  two-player  rescaled  part¬ 
nership  games,  the  reinforcement  scheme  satisfies  the  conclusions  of  Theorem  7.1.  Furthermore,  if 
the  conditions  of  Proposition  8.2  hold  and  the  game  adm  its  pure  Nash  equilibria  which  are  all  strict,  then, 
for  some  [5  £  (0, 1)  sufficiently  close  to  one  and  A  >  0  sufficiently  small,  the  process  {fik (co)  =  x(k)} 
converges  to  the  set  S^Efor  almost  all  u>,  i.e., 

P  lim  x(k)  £  =  1. 

.  k — S'-OO 

Proof.  It  suffices  to  show  that  the  conditions  of  Assumption  7.1  are  satisfied.  In  particular,  define  the 
nonnegative  function 

V (x)  =  /max  -  f(x )  >0,  x  £  A, 

where  /  is  defined  according  to  (21)  and  /max  —  sup3,eA  f(x).  Note  that  VXlV(x)  =  —D[x-i,  and 

VxV(x)Tg(x)  =  -J2x-i(Di)Tgi(x) 

i£l 

=  ~  y~]  XjfxijDjX-i 

iei 

=  -^2  apxL,  DjXt  (xl )  Dtx-t 

iei 

<  0 

for  all  x  £  A,  where  the  last  equality  is  due  to  property  (22)  and  the  last  inequality  is  due  to  the  fact  that 
Xi(xi)  is  a  positive  semidefinite  (symmetric)  matrix  (as  was  shown  in  the  proof  of  Proposition  8.1  and  has 
pointed  out  in  Exercise  9.6.3  of  [10]). 

Also,  due  to  (22),  we  have  that  the  stationary  points  of  the  mean-field  dynamics  in  a  rescaled  partnership 
game  with  payoff  matrices  Di,  i  £  1,  coincide  with  the  stationary  points  of  the  mean-held  dynamics  of  the 
partnership  game  D\  =  a,  I),  +  C,.  i  £  X.  We  also  observe  that 

VxV(x)Tg(x)  =  0  <=>  aiX^Dj Xi{xi)DiX-i  =  0,  Vi  €  1 

which,  according  to  the  proof  of  Proposition  8.1,  is  satisfied  if  and  only  if  x  £  A  is  a  stationary  point  of 
g(x). 

Thus,  the  conditions  of  Assumption  7.1  are  satisfied  for  the  two-player  rescaled  partnership  games  and 
therefore  the  conclusions  of  Theorem  7.1  hold.  Furthermore,  if  the  conditions  of  Proposition  8.2  apply  and 
the  game  admits  pure  Nash  equilibria  which  are  all  strict,  then  the  conclusions  of  Proposition  8.3  also  hold. 
□  □ 

An  analogous  result  to  Proposition  8.4  has  been  shown  by  [4]  for  rescaled  partnership  games  under 
the  reinforcement  learning  scheme  of  [3].  Proposition  8.4  can  be  thought  of  as  an  extension  to  a  larger 
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class  of  reinforcement  learning  schemes  (beyond  the  urn  process  of  [3]),  due  to  the  freedom  in  the  selection 
of  the  step-size  sequence  (4).  In  fact,  analogous  results  under  a  perturbed  decision  rule  of  the  form  (6) 
can  be  derived  in  a  straightforward  manner  for  other  forms  of  reinforcement  learning  schemes,  e.g.,  the 
reinforcement  scheme  of  [1], 

9  Conclusions 

This  paper  presented  a  novel  reinforcement  learning  scheme  for  distributed  convergence  to  Nash  equilibria. 
The  main  difference  from  prior  schemes  lies  in  the  introduction  of  a  perturbation  function  in  the  decision 
rule  of  each  agent  which  depends  only  on  its  own  strategy.  The  introduction  of  such  perturbation  function 
sidestepped  issues  regarding  the  behavior  of  the  algorithm  close  to  the  vertices  of  the  simplex.  In  particular, 
we  derived  conditions  under  which  the  perturbed  reinforcement  learning  scheme  converges  to  an  arbitrarily 
small  neighborhood  of  the  set  of  Nash  equilibria  almost  surely.  This  constitutes  our  main  contribution, 
since  prior  convergence  analysis  on  reinforcement  learning  has  primarily  focused  on  urn  processes.  We 
further  specialized  the  results  to  a  class  of  games  which  belongs  to  the  family  of  potential  games.  We  finally 
extended  the  convergence  results  to  two-player  rescaled  partnership  games,  where  we  derived  conditions 
under  which  convergence  to  perturbations  of  strict  pure  Nash  equilibria  can  be  achieved. 

A  Proof  of  Proposition  5.2 

For  any  agent  i€l  and  any  action  s  £  At,  the  corresponding  entry  of  the  vector  field  is 

gis{x)  =  Uis(x)[(  1  -  C i)xiS  +  C i/  |A|]  ~  Y  Uiq(x)[(l  -  C i)xiq  +  0/ |A| \xiS,  (23) 

q(zAi 

where  Q  =  (t(xt,  A). 

Consider  any  pure  strategy  profile  x*,  and  take  x  =  x*  +  v,  for  some  v  =  (1^1 ,  v 2, ...,  vn)  £  XjgxRl"4^ 
such  that  Ui  £  nulljl1 }  and  x,;  =  x*  +  u,  £  A ( | Tl, | )  for  all  i  £  X.  Substituting  x  into  (23),  yields 

9is(u,  A)  =  Uis(x)  [(1  -  C i(x*s  +  vis )  +  ( if  |  A|] 

“  U W(®)  [(X  “  C i)(Xiq  +  Uiq)  +  Ci/  |A|]  (X*is  +  Vis)- 

where  Q  =  Q(x*  +  //, ,  A ) .  The  perturbation  function  has  the  following  properties: 

^Yp—1  =  0,  for  all  j  £  A- 

VVij  (o,0) 

Furthermore,  gis(0, 0)  =  0,  since  x*  is  a  stationary  point  of  the  unperturbed  dynamics.  Thus,  the  partial 


25 


derivatives  of  gis  evaluated  at  (0,0)  arc: 


dgis{v,X) 

duis 


(0,0) 


Uis(x*)(l-x*is)-J2Uiq(x*)x*q, 


du. 


iq 


(0,0) 


=  -Uiq(x*)x*s,  for  all  q  £  Ai\s. 


Note  also  that  for  any  i  £  I\i  and  rn  £  At,  we  have 


dgisi^i  A) 


dugm 


_  dUjsjx) 

(0,0)  dlAgm 


(0,0) 


E 


dUiq(x 


dlAgm 


(0,0) 


*  * 
'^iq'^is  • 


Since  x*  corresponds  to  a  pure  strategy  state,  for  each  i  £  T  there  exists  j*  =  j*(i )  such  that  x*  =  e3>  , 
i.e.,  Xij*  =  1  and  x*s  =  0  for  all  s  ^  j*.  For  this  pure  strategy  state  and  for  any  s  £  Ai\j*  we  have 


dgis(iA,  A) 


du,;. 


(0,0) 


=  Uis(x*)  -  Uij  (x* ) , 


and 


dgis(^,  A) 


duiq 


=  0  Mq  £  Ai\s, 


&gis(u,  A) 


(0,0) 


du. 


(0,0) 


=  0  W  £  l\i,  m  £  Ag. 


Given  that  ut  £  nullj  l 1  }  and  dgis(u ,  A )/duij*  =  0  for  all  s  A  the  behavior  of  g(-.  •)  with  respect 
to  v  about  the  point  (0, 0)  is  described  by  the  following  Jacobian  matrix: 


V„0(i/,  A)  |(0j0)  = 


diag{C/is(x*)  -  Uij*(x*)}SjCj* 

0 

The  above  Jacobian  matrix  has  full  rank  if  for  each  i  £  1 


0 


diag{t/ns(x*)  -  Unj*(x*)}Sjtj* 


Uis (x* )  -  Uir  {x* )  A  0  for  all  s  A  3*  ■ 


In  this  case,  by  the  implicit  function  theorem,  there  exists  a  neighborhood  D  of  A  =  0  and  a  unique  differ¬ 
entiable  function  u*  :  D  — »•  such  that  i/*(0)  =  0  and  ~g{u*{ A),  A)  =  0,  for  any  A  £  D. 

To  characterize  exactly  the  stationary  points  for  small  values  of  A,  we  need  to  also  compute  the  gradient 
of  the  mean-field  with  respect  to  the  perturbation  parameter  A.  Note  that: 


dgis(u,  A) 
d\ 


(0,0) 


uis(x)  d(i 

\Ai\  d\ 


(0,0) 


Uis  (®) 

|  At;  | 


since  the  partial  derivative  of  Q  with  respect  to  A  when  evaluated  at  (0,  0)  is  1. 
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Thus, 


/  col{C/is(x*)/|A|}s^* 

Va5(^A)|(0i0)  =  : 

\  co\{Uns(x*) /\An\}s^j* 

Again,  by  implicit  function  theorem,  we  have  that 

Va^*(A)|a=o  =  -(V„<7(i/,A)|(0i0))-1  •  Va ]j(v,  A)  1(0,0) 


which  implies  that  for  any  i  £  1  and  for  any  s  /  j* 


dX 


A=0 


1 

(Uis(x*)  -  Uij*(x*))  ’ 


Since  zA(0)  =  0  and  x*s  =  0,  in  order  for  the  solution  x  =  x*  +  v*( A)  to  be  in  A°,  we  also  need  the 
condition  dv*s(\)  /  d\\\=Q  >  0  to  be  satisfied  for  all  s  /  j*.  Since  Uis(x*)  >  0  by  Assumption  3.1,  this 
condition  is  equivalent  to 

Uis(x*)  -  Uij*(x*)  <  0 

for  all  i  £  X  and  any  s  /  j* .  This  is  also  equivalent  to  x*  being  a  strict  Nash  equilibrium.  Thus,  the 
conclusion  follows. 

If  x*  corresponds  to  an  action  profile  which  is  not  a  Nash  equilibrium,  then  there  exist  %  £  X  and 
s  /  j*  such  that  UiS(x*)  —  (x*)  >  0.  For  any  ft  £  (0, 1)  which  is  sufficiently  close  to  one,  there  exist 

do  =  do(ft)  such  that  (i(xi.  A)  =  0,  i  £  X,  for  any  x  £  A \Bg(x*),  A  >  0  and  S  >  So-  For  any  x  £  Bg(x*), 
5  >  Sq,  the  vector  field  becomes 


9is(x)  ~  pis(x)  -  Uij* (x)]xis  +  Uis(x)Q(xi,  A)/  \Ai\  (24) 

plus  higher  order  terms  of  A  and  5,  for  all  s  /  j*.  Since  the  Nash  condition  is  violated  in  the  direction  of  s, 
Uis(x)  —  Uij*{x)  =  c  +  0(6),  for  some  c  >  0,  where  0(6)  denotes  a  quantity  of  order  of  6.  Furthermore, 
by  Assumption  3.1  of  strictly  positive  rewards,  Uis(x)  >  0  for  all  s  £  At  and  x  £  B$(x*).  Therefore,  for 
any  6  >  ho  and  for  sufficiently  small  A  >  0,  the  vector  field  gls(x)  >  0  for  any  x  £  Bg(x*),  which  implies 
that  there  is  no  stationary  point  of  the  vector  field  in  B$(x*). 

B  Proof  of  Proposition  6.1 

Let  us  assume  that  action  profile  a  =  (ai,  0:2,  ...an)  £  A  has  been  selected  at  time  k  =  0.  This  implies 
that  Xiai( 0)  >  0,  since  actions  are  selected  according  to  the  strategy  distribution  <jj(0)  =  .t,(0).  The  cor¬ 
responding  payoff  profile  will  be  R(a)  =  (R\(a),  Ihio). ...,  Rn(a)),  where  according  to  Assumption  3.1, 
Ri(a)  >  0  for  all  i  £  1.  Let  us  define  the  following  event: 

At  =  {uj  £  fl  :  ipk( w)  =  ct(k)  =  a  for  all  k  <  r}  . 
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Thus,  Ar  coiTesponds  to  the  case  where  the  same  action  profile  has  been  performed  for  all  times  k  <  r. 
Note  that  the  sequence  of  events  {AT}  is  decreasing,  since  Ar  D  Ar+ 1  for  all  r  =  1,  2, ....  Define  also  the 
event 

OO 

Aoo  =  P|  At  =  {a(r)  =  a,  Vr }. 

T=  1 

Therefore,  from  continuity  from  above,  we  have: 


P[4»]  =  lim 

T— YOO 


=  lim 

T— >00 


n  -  x(a) 


k= 1  i€Z 


The  above  upper  bound  x(a)  is  non-zero  if  and  only  if 

OO 

log (xiai  ( k ))  >  — oo  for  each  i  £  X.  (25) 

fe=i 

Let  us  define  the  new  variable 


Ui{k)  =  1  -  xiai(A;)  =  ^  Xij(fe), 

jeAi\at 

which  corresponds  to  the  probability  of  agent  i  selecting  any  action  other  than  a*.  Equivalently,  condition 
(25)  is  equivalent  to 

OO 

—  ^2  log(l  —  yi(k))  <  oo,  for  each  i  £  X.  (26) 


k= o 


We  also  have  that 


lim  -‘°g(1,7»‘W)  =  lim  .  1  ...  >  „ 


k — ^oo  Di{k)  k — ^oo  1  —  Ui(k) 

for  some  finite  p  >  0,  since  0  <  y  ■,(}<■)  <  1.  Thus,  from  the  limit  comparison  test,  we  conclude  that  condition 
(26)  holds,  if  and  only  if 

oo 

yi{k)  <  oo,  for  each  iel. 

k= 1 

Since  e(k)  =  l/(ku  +  1),  for  1/2  <  v  <  1,  we  have: 

Vi(k  +  1)  =  1  _  -gj(a)  <  1  _  Ri(a) 


Ui{k)  ‘  AX  +  1 

By  Raabe’s  criterion,  the  series  Y^k= o  Vift)  's  convergent  if 

Vi(k) 


k  +  1 


lim  k  ,  , . 

fe->oo  \yi(K  +  l) 


-1  >  1. 


Since 


Vi{k) 
Ui{k  +  1) 


-  1  >  fc 


1  - 


Ri(a) 

fc+1 


-  1  =  k 


Ri(a) 


Ri(a) 


k  +  1  —  Ri(a)  i  _|_  t-fldQ) 
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we  conclude  that  the  series  Yl'kLo  Vi(k)  's  convergent  if  Rfa )  >  1  for  each  i  £  1.  In  other  words,  the 

action  profile  a  will  be  performed  for  all  future  times  with  positive  probability  if  Ri  (a)  >  1  for  all  i  G  X. 

Furthermore,  if  R,  (a)  >  1  for  all  i  £  X  and  for  all  a  €  A,  then  the  probability  that  the  same  action  profile 

will  be  played  for  all  future  times  is  uniformly  bounded  away  from  zero  over  all  initial  conditions. 
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